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Abstract. The designers of virtual agents often draw on a large research 
literature in psychology, linguistics and human ethology to design embodied 
agents that can interact with people. In this paper, we consider a structural 
acting system developed by Francois Delsarte as a possible resource in 
designing the nonverbal behavior of embodied agents. Using human subjects, 
we evaluate one component of the system, Delsarte's Cube, that addresses the 
meaning of differing attitudes of the hand in gestures. 
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1 Introduction 

At a recent forum on emotion at the Swiss National Exchange, Leonard Pitt, an expert 
in mime and the use of masks, performed in front of an audience of assembled 
emotion researchers. He donned masks with fixed facial expressions and then 
proceeded to explore how alterations of posture and gesture could manipulate our 
impression of the emotional/ attitudinal state of his character and override the emotion 
expressed by the mask. The performance served to demonstrate poignantly the 
panoply of behaviors that convey personality and emotion. Although it was not the 
performer's intent, it also demonstrated that designers of virtual humans have a long 
way to go in creating such expressivity in their behavioral models and animations. 

Designers of virtual humans have been very effective in mining the large literature in 
psychology that has studied such phenomena. Of notable distinction is the work of 
Ekman & Friesen (1978), along with their collaborators, on the facial action coding 
system (FACS) that breaks down facial expression into action units. FACS has 
provided a systematic, exacting basis for emotion researchers to study and catalog the 
expressive capabilities of the face and its role in human interaction. Not surprisingly, 
the FACS has also helped to spur significant advances in the facial expressions of 
ECAs (Embodied Conversational Agent, also known as a virtual human). The 
systematic coding approach provided by FACS allows for the decomposition and 
enumeration of the space of all possible facial expressions (though all are not 
anatomically possible). For example, armed with such descriptions, the designer of an 
ECA can build a head that covers that space as well as explore manipulations of the 
compositions and dynamics of action units. 



There has also been extensive work on posture (e.g., Mehrabian, 1969; Walbott, 
1998), gesture (Kendon, 2004; McNeill, 1992), gaze (Argyle & Cook, 1976), etc. 
Unfortunately, this work has typically not matched the systematic level of analysis 
that the facial expression research has achieved. In particular, the psychological 
research in these areas has provided far less guidance in the cataloging of the space of 
all possible head movements or how to exploit and understand the dynamic manner of 
such movements. 

Augmenting the psychological work, there are also a variety of other resources that 
virtual human researchers can draw upon from the performance arts that augment the 
psychological literature. Such resources include acting technique, choreography and 
rhetorical gesture. For example, structural acting systems propose that certain 
behaviors and the manner in which they are performed convey particular meanings, 
for example, the work of Laban (see Newlove, 1999) and Delsarte (see Zorn, 1968). 
This work has in fact informed virtual human research (Noma et al. 2000; Costa et al, 
2000; Neff & Fiume, 2004). On the other hand, some acting theories are not directly 
relevant to our work. For example, Stanislavsky's acting technique (Stanislavski, 
1989) informs the performer to put herself in the mental and emotional state of her 
role and the appropriate physical performance will then naturally follow. This clearly 
is less relevant to virtual human work, since our virtual humans don't come equipped 
with a model that maps between internal state and behavior. Indeed it is this mapping 
that we seek to create. 

The analysis technique of Delsarte is particularly intriguing for virtual human 
researchers. This technique systematically and extensively describes how emotions, 
attitude and personality are conveyed in dynamic body postures and gestures. 
Delsarte's work is based on his extensive observations of human behavior across a 
range of ecologically varied settings. Unfortunately, the description of the technique 
is often couched in a language and terminology from the 1800s that strikes a 21 st 
century reader as perhaps quaint and metaphysical. More importantly, Delsarte's 
observations have not been empirically confirmed. 

It is in this context that we have been exploring Delsarte's work and asking very basic 
questions. Are the interpretations that people derive from Delsarte's catalog of 
movements consistent with Delsarte's analysis or at least reliable across observers? In 
this paper, we begin to address these questions with some preliminary human 
experiments targeting what Delsarte describes as attitudes of the hand. The results of 
this pilot study suggest that Delsarte's work deserves some closer scrutiny. For the 
reader who is not familiar with the Delsarte technique, we will present a brief primer 
on its relevant features prior to the description of our method and the presentation of 
results. 



2.0 Acting Methods and Delsarte 



Acting operates simultaneously on two levels. The opaque level pertains to the body 
and voice of the actor; we generally associate this level with the skills and virtuosity 
of the acting craft. The transparent level pertains to the stories and emotions revealed, 
as conveyed through the actor's body and voice. In short, the audience looks at the 
opaque body of the actor in order to see through that body a character. 

Contemporary actors usually work from the transparent level and assume that their 
trained bodies and voices will automatically follow. They generally do not think 
about their bodies when they perform. Expressivity will come as a natural 
consequence of establishing the proper internal state. If actors are trained in the 
Stanislavsky System, they begin by asking a simple question. What would I do, if I 
were in the circumstances of my character? If actors are trained in the American 
Method, they ask themselves a slightly different question. What is it in my own life 
that makes that makes me feel the way my character must feel in this scene? In both 
cases, the contemporary actor places himself or herself imaginatively in the fictional 
world of the character and then behaves as that world dictates (see Carnicke, 1998, for 
a comparison of the two approaches). 

Unfortunately, contemporary acting practice does not help much in developing a 
computer model for encoding emotionally realistic physical gestures in virtual 
characters. Virtual humans do not come equipped with internal mappings from 
internal cognitive and emotional states to behavior that can be expected to motivate 
their movements. As Hooks writes [2000], "Actors create emotion - largely internally 
- in the present moment, while animators describe internal emotion through the 
external movement of their characters." Moreover, the goals of the virtual human 
designer, to describe movement, to break it down into its multiple components- 
spatial shapes, temporal rhythms, force and direction, and to collect data about the 
encoding and decoding of emotion through the body are arguably antithetical to the 
artistic work of the contemporary actor. The contemporary actor allows whatever 
happens in the moment to occur without intellectually judging it or analyzing it. 

In order to address our goals we decided to go back to the past for ways to think about 
acting — to the time when actors worked more consciously with the opaque level of 
performance. We thus returned to the techniques of gestural or structural acting. 

2.1 Introduction to Delsarte 

Francois Delsarte lived from 1811 to 1871, a French singer who had lost his voice 
because of poor teaching practices. He began to study the relationships between 
physical behavior, emotion, and language in order to formulate scientific principles of 
expression. Over many years, he diligently observed the expressive postures and 
gestures of living people across all walks of life as well as corpses in the Paris 
morgue, developing a broad model of how emotion, body, and language interact. 



Delsarte became the most significant acting teacher in Europe and one of his students, 
Steele MacKaye, became the leading force in American actor training. MacKaye 
studied with Delsarte in Paris in 1869, became his assistant in 1870, and founded the 
first professional acting school in the United States in 1884. Thus, Delsarte provided 
the predominant form of actor training in the United States until 1923 when 
Stanislavsky first brought the Moscow Art Theater to the United States. 

Delsarte saw that movement involves a "semiotics" (Delsarte' s own term) — a sign 
system that can be "read" by observers. Thus, the body encodes meaning which the 
viewer can decode. He recognized that physical "signs" come from various sources- 
some gestures are ours alone and express our individuality; some are social or cultural 
conventions, like waving "hello"; and some may be biologically connected to our 
emotional reactions (from Delsarte's Rhetoric, e.g. see Zorn, 1968). 

The complexity of Delsarte's system is both daunting and a potential advantage. It 
specifies a vast range of potential gestures and postures by working through a series 
of principles that defines a space of variations across head orientations, stances, hand 
shapes, leg positions and arm orientations as well as the meanings they convey. 
Further, Delsarte argues how the zones of the body and space around the body tend to 
be associated with differing intellectual, emotional and physical interpretations. 

In other words, Delsarte's system provides an enumeration of behaviors potentially 
both in terms of orientations and movements through space. For example, consider 
head movement. We know from linguistic studies (e.g., McClave, 2000) that a head 
shake or sweep may signify inclusivity ("everyone"), a tilt upward that averts gaze 
can signify an effort to think or regulate cognitive load (Argyle & Cook, 1978) and 
ethologists tell us that a tilt to the side can signify flirting behavior. Pieced together, 
these studies can greatly assist in the design of virtual humans. However, it is 
piecemeal. Delsarte, on the other hand, lays out all possible head tilts and what they 
could signify. Similarly, consider gestures. Delsarte suggests that the orientation of 
the gesture in space, the shape of the hand and fingers, the starting and ending 
location of its movement all impacts what the gesture signifies. Overall, this provides 
a considerable amount of raw material for designing virtual humans. 

Other more recent physically based actor training systems (such as that of Laban) 
reduce the number of distinct movements that actors are expected to study. As Hecht 
observes (1971), physical training systems for the actor that come after Delsarte get 
progressively simpler. For the actor, they seem more manageable, hence more useful. 
But as a consequence are also limiting for our purpose. 

The specifics of Delsarte's system are too extensive to cover in any detail. We 
therefore confine ourselves to a few brief comments about his work. 



2.2 Attitudes of the Hand 



In describing attitudes of the hand, Delsarte talks of an imaginary cube in front of the 
speaker. Consider grasping it from each possible surface— use two hands to contain 
the outer surfaces; push a hand outward against its inner surface; bring a hand upward 
to its lower surface; explore every possible way of grasping and containing this cube. 
Each gesture has a different connotation. 

Experts on Delsarte differ in the details of what these various positions of the hand 
signify. There are several sources that describe the cube (e.g., Delaumosne in Zorn, 
1968; Shawn, 1963). The basic intuitions of the cube seem on target to us. To address 
some of the discrepancies across interpretations of Delsarte's teaching, we 
synthesized the various approaches into the following hypotheses of how they would 
be interpreted: (The one marked with a plus is a hand posture suggested by the 
authors.) 

• Palm of Hand on face of cube farthest away from body => to limit 

• Palm of Hand on interior of face nearest body => to possess + 

• Palm of Hand on face of cube nearest body => to stop 

• Palm of Hand on side surface of cube => to possess, include 

• Palm of Hand on interior side surface of cube => to reject, remove 

• Palm of Hand on top surface of cube => to control 

• Palm of Hand on bottom surface of cube => to support 



2.3 The Three Orders of Movement 

Delsarte moves beyond the static systems of gestural actor training that were used in 
the 17th and 18th centuries by paying close attention to the dynamics of motion. In 
particular, we became particularly interested in three types of motion that Delsarte 
identifies: 

Oppositions: 

Any two parts of the body moving in opposite directions simultaneously suggest 
expressive force, strength, physical or emotional power. For example, a rejecting 
motion of arm and hand is strengthened by an opposite motion of head and torso. 
(This brings to mind Newton's Third Law: For every action, there is an equal and 
opposite reaction.) 

Parallelisms: 

Any two parts of the body moving in parallel directions simultaneously suggest 
deliberateness, planning, intentionality. An example of this would be arms moving 
downward in parallel in a beat gesture. 

Successions: 

"Any movement passing through the body which moves each part [of the body in 
turn] (in a fluid wave-like motion)" (Shawn, 1963). True successions move from the 



face, through the torsos and into the arms and legs and suggest sincerity and 
normality. Reverse successions work backwards from the limbs into the face and 
suggest falsity and insincerity. 



3.0 Evaluation 

Delsarte's system appears at times to be full of interesting insights and at other times 
mired in metaphysics and a performance culture that is less relevant to the scientific 
study of modern gesture or acting. It is in the hope to extract the insights and validate 
them that we have begun to experiment with Delsarte-based behaviors. As the focus 
of our first experiment, we selected Delsarte's idea of the imaginary cube as 
describing a space of hand attitudes, in part because it resonates with observations on 
how gestures often manipulate imaginary objects (McNeill, 1992). 



Stimuli 

The stimuli we constructed were animations of a virtual character that involved the 
hands moving from a rest pose (arms at the side of the body) to a position on an 
imaginary cube and then returning to the rest pose. 

The animations were crafted with an un-textured body that moved the hand to a 
position on an imaginary cube. Linear interpolation was used wherever possible, 
relying largely on simple interpolation between a few poses as opposed to hand- 
crafting a rich animation. This was done deliberately to avoid inferences from the 
appearance or physical manner of the animations. The internal cube face animations 
were designed so that it appears as if the hand was moving through the cube from the 
opposite face. Figures 2a and 2b depict the position of the hands on the imaginary 
cube. As can be seen, in some of the interior face stimuli the position of hand is more 
consistent with having pushed the "cube" as opposed to resting the hand on the 
interior face. The animations used in this experiment can be found on the web: 
http://www.isi.edu/~marsella/experiment/. Note that subjects got no other context to 
guide their interpretations — the animations are silent, there is no text, sound or 
dialog. We chose to use animations, instead of stills, because human gestures have 
motion, but one can imagine a similar experiment using stills. 

Hypothesis: 

Table 1 lists the relationship between faces of the cube, animation files and 
predictions. Note that we explored a superset of the faces described in writings about 
Delsarte's cube. As noted earlier, the authors included an animation of the hand 
moving to the nearest internal face that we predicted would be read as a statement 
about possession. We also included 3 additional cases of the hand pushing through the 
cube: the farthest-interior face, the bottom-interior face and the top-interior face. Our 
predictions for these three additional animations were that they would be the same as 



the corresponding animation that did not push though the cube. These additional 
predictions are marked by a plus sign. Also, Delsarte only discusses a single hand, 
whereas we also explored two handed variants. A 2-handed animation was not done 
for the side internal face since the hands would have crossed. Our predictions for the 
two-handed animations are that they would coincide to the single-handed case. 

Procedure: 

We presented these stimuli to subjects, broken into two groups. One group was shown 
animations that used only one hand and the other group was shown the gestures using 
both hands moving to positions on the cube. The order of presentation of animations 
was randomized. There were 28 subjects in the one-hand condition and 22 subjects in 
the two-handed condition. Subjects were recruited using Craig's List (a web-based 
job market). 



Face in relation to body 


One Hand 


Two Hand 


Prediction 


Farthest, exterior 


FarExt 


2FarExt 


Limit 
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Nearlnt 


2NearInt 


Possess + 
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2NearExt 


Stop 


Farthest, interior 
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2FarInt 


Stop + 


Side, exterior 
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2SideExt 


Possess 


Side, interior 


Sidelnt 


None 


Reject 


Top, exterior 


TopExt 


2TopExt 


Control 


Bottom, interior 


Botlnt 


2BotInt 


Control + 


Bottom, exterior 


BotExt 


2BotExt 


Support 


Top, interior 


Toplnt 


2TopInt 


Support + 


Table 1: Relation between Cube Faces, Animations and Predictions 



To begin the experiment, subjects sat down at a computer interface that provided the 

following instructions: 

You will see videos of an animated character . Each video 
will show the character performing a gesture while 
interacting with someone off -camera. 

After each video , you will be asked to choose one phrase 
that best describes what the gesture conveys. 

After each animation, they were then provided a forced choice questionnaire. See 
Table 2. The authors broke the interpretation of "Stop" into two variants, Stop you 
and Stop it. Similarly, "Reject" was broken into two variants of Reject It and Reject 
your Idea. This was done to explore the assumption that some motions may have 
different referents, either an abstract idea or the listener. 



He appears to be expressing: 




Possession: "It's mine" 
Support: "I am going to support it" 
Control: "I am going to control it" 
Limit: "I am going to limit it" 
Stop: "I am going to stop you" 
Stop: "I am going to stop it" 
Reject: "I reject it" 
Reject: "I reject your idea" 

Click submit after making a selection: 



Figure 1: Questionnaire 



Results: 

Figures 2a and 2b show frequency distribution plots of subjects' preferred 
interpretations. Predicted responses are indicated with grey arrows. The white arrows 
are used to indicate our added predictions. In the case of Stop and Reject, where a 
prediction was broken into two variants, two arrows identify the predictions. 

With the exception of SideExt these distributions are highly unlikely to be obtained 
by chance (Chi-square test for all 8 categories being equal, p< 0.05). This result is, 
however, not informative enough, since we were mainly interested in the proportions 
of subjects' responses on predicted categories and did not make any assumptions 
regarding the distributions of responses on other categories. 

For this reason we have performed another series of tests to find out if the proportions 
of subjects' responses for individual categories were higher than the chance level. For 
each animation we have aggregated the data into 2 groups: predicted and un- 
predicted, and used Chi-square statistic to test if the proportions for predicted groups 
were above chance level. The predicted group contained either a single category or, in 
the case of Stop and Reject, two categories. We defined chance level as 0.125 (1 out 
of 8) for a single category, and 0.25 (2 out of 8) for the cases where 2 categories were 
combined together. The results are presented in Table 2 (left half). Observed 
proportions that were significantly higher than the chance level (p < 0.05) are in bold 
and marked with asterisks. 
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Table 2: Proportions of subjects' responses for selected categories 



As one can see from the table, 11 out of 19 predictions are supported by our data. 
Note, however, that in several of the other cases the proportions for predicted 
categories were actually lower than the chance level (shown in italic) - sometimes 
predicted category happened to be the least frequently selected by the subjects. In 
those cases, where predicted category was not the most frequent one, we have 
repeated the same analysis using the actual most frequent category instead of 
predicted. The results are presented in the right half of Table 2, and with a single 
exception, the subjects' preferences appear to be well above chance level. 

The plots and the results of statistical analysis reveal considerable consistency in 
people's responses with clearly preferred responses. Moreover, the predicted response 
is generally the preferred category. 

When the hand(s) end in a position where the palms face the virtual human (FarExt, 
2FarExt, Nearlnt and 2NearInt), possession is the preferred interpretation. In the case 
of Nearlnt and 2NearInt, this was predicted. Given the similar movement and ending 
position for all 4 of these gestures, it is not surprising that the preferred response for 
FarExt and 2FarExt are also possession, even though it was not predicted. 

When the palm faces away from the virtual human, on the face of the cube closest to 
it (NearExt and 2NearExt), there is a strong preference for a "stop" interpretation, as 
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Table 3: Comparison of distributions for 1 vs. 2 handed animations 



predicted. In the case of NearExt, there is a strong preference for "stop you" over 
other interpretations. In the case of 2NearExt, the responses are largely split between 
"stop you" and "stop it". (But as noted above, the Chi-square analysis aggregated 
these two responses). 

A similar "stop" response is shown when the hand moves forward through the cube - 
the Farlnt and 2FarInt animations. Although here the responses are somewhat more 
spread over stop you and stop it, aggregated analysis shows a significant difference 
with the chance distribution. 

When the hand is on the side of cube, there is more spread in the responses. SideExt, 
specifically, is the only animation where the differences between all responses are not 
significant. We discuss possible explanations in the discussion section. For 2SideExt, 
they are above chance level, but the most popular response ("limit") is not 
significantly different from the others. 

In Sidelnt, the hand moving through the cube sideways, there is a strong preference 
for a combined "reject" interpretation (aggregating reject it and reject you), as 
predicted. TopExt's preferred response is limit. The predicted response, however, 
control, is closely related conceptually. In the case of the hands on the bottom of the 
cube with palms up, BotExt and 2BotExt, there is a strong preference for a support 
interpretation, as predicted. Finally, a similar, but even more prominent response is 
seen in the case where the hands move up through the cube, Toplnt and 2TopInt. 

Finally, comparison between 1 -handed and corresponding 2-handed animations shows 
that their response distributions are very similar. The results of a Chi-square test are 
presented in Table 3: the differences within each pair of animations are non- 
significant (at p < 0.05). 




Figure 2a: Frequency distributions of subjects' interpretations 




Figure 2b: Frequency distributions of subjects' interpretations (continued) 



Discussion 



The results reveal considerable consistency in the subject's interpretations. Further the 
results are generally consistent with predictions. The results are particularly surprising 
given the minimal context the subjects were presented - they simply got these 
movements to interpret without other context or dialog. This suggests that Delsarte's 
cube may provide useful insight in how a virtual human's gestures can use physical 
space to convey meaning. And going beyond the cube, the current results suggest that 
perhaps the larger body of Delsarte's work is deserving of closer attention. 

In the particular experiments reported here, there is room for improvement. We chose 
to use animations instead of static poses of the hand resting on the cube faces. 
Obviously, movement conveys considerable meaning that can easily override the 
pose. This may in fact explain some of the results that were not consistent with 
Delsarte's predictions. It will be informative to repeat the study with different motions 
to the pose. Further, animations whose responses revealed far less consistency may be 
due to the fact that the categories used in this experiment were synthesized by trying 
to find an intersection across the writings of several Delsarte experts. As a 
consequence, this synthesis ended up restricting the number of categories and more 
importantly the richness of the category descriptions. The fact that some animations 
did not have a strongly preferred interpretation may be an artifact of there being too 
few categories and limiting the possible interpretations. It may be informative and 
more useful to just use the categories as described by a single expert or alternatively 
to take a union of the interpretations across experts. 

Of course, consistent interpretation, free of any specific interactional context or 
dialog, is a strong test for nonverbal behavior which, by itself, is often ambiguous. In 
fact, as long as an observer decodes the behavior appropriately in an interactional 
context that is well defined, the behavior has potential utility for a virtual human 
designer. There also is the issue of evaluating how natural the gestures appear to the 
subjects, an issue that is distinct from but may correlate to a degree with consistency 
of interpretation. Finally, the animations used a single rotation of the hand with 
respect to the cube face and a single hand shape, and manipulation of these factors 
may influence the interpretation, as Delsarte argues. 

6. Conclusion 

The design of virtual humans is an interdisciplinary task. As a community, we all 
draw heavily on research in artificial intelligence, psychology, human ethology and 
linguistics, to name a few fields. We have also drawn on insight from the arts, 
including narrative theory, animation, dance, theatre and film. 

The preliminary work reported here attempts to go beyond just drawing insights from 
the arts. The knowledge and aesthetics acquired by the performance arts can provide a 
more systematic basis for the design of our virtual humans. We believe this to be a 



common goal shared by many in the virtual human community and see this work in 
the context of trying to help achieve that goal. 
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